Clare Heinbaugh

11/24/2020

Gravity model overview

Gravity models allow us to model human migration and transportation. While wealthier countries have more data on human movement, low middle-income countries must rely on models to understand the movement of persons. This concerns human development because knowing how people move affects policy planning like road construction, infrastructure, and even tracking the progression of diseases. One could argue that the advent of COVID-19 as a global pandemic makes it more important than ever to develop accurate models of human movement.

The gravity model is based on the physics formula for gravity \[F_g = \frac{G \cdot m_1 \cdot m_2}{r^2}\] \(r\) references to the distance between the two bodies of interest and \(m_1\) and \(m_2\) are the two masses for which we want to model the gravitational force. \(G\) is the gravitational constant. In this case, the force of gravity is directly proportional to the masses and inversely proportional to the square of the distance.

We can apply similar logic to model human movement with a gravity model. Unlike the physics formula for gravity, there are many more variables that serve as push-pull factors for human movement. Garcia et al. suggests that geographic, sociodemographic, economic, climatic, and environmental factors all affect a person’s choice to move. For instance, in class, we discussed that gender may be an indicator of daily movement, because a woman may be more likely to take children to school than go to work. The data science gravity model allows these additional factors to be included in a gravity model with the standard variable distance and population. According to Garcia et al., the most simple model gravity model can be expressed as: \[MIG_{ij} = \frac{p_i^a \cdot p_j^b}{d_{ij}^c}\]

\(p_i\) and \(p_j\) represent the populations as the origin \(i\) and destination \(j\) and \(d_{ij}\) corresponds to the distance between the origin and destination. Here the coefficients \(a,b,c\) allow researchers to control the relative influence of each of these variables. But how do we find the best coefficients?

In class, we started our gravity model investigation with a tutorial for modeling the daily commute flows between boroughs of London. The author Dennet describes different ways to find these coefficients. First, he creates a gravity model matching each coefficient to the coefficients from physics, i.e. \(a = b = 1, c = 2\). Using flow data from the Census Support Flow Data Service, he develops a model with an \(r^2\) goodness-of-fit value of 0.51 or 51% accurate. The first method he suggests to improve these coefficient estimates, and thus the overall model, is to incrementally adjust the coefficients and calculate model accuracy until a better model is achieved. However, he proposes a third, much simpler way, to define these coefficients. He notes that taking the natural log of the above gravity model formula yields a regression model whose coefficients can be found using a Poisson regression model. \[\ln{MIG_{ij}} = a\ln{p_i}+b\ln{p_j}-c\ln{d_{ij}}\] At a high-level, we use a Poisson distribution instead of a continuous probability distribution because there are no fractions of people or negative people. Since the Poisson distribution is discrete, it works well as a gravity model for human movement. The \(gravity\) package for \(R\) wraps these model calculations.

The goal of my project was to construct a gravity model for flows between adm1’s in Eswatini between the years 2010 and 2015. I also created various animations to illustrate the movement of people. Finally, I decided to use my gravity model to attempt to model the flows for Pigg’s Peak, an adm2 in Eswatini. Garcia et al.’s research also focused on adding additional explanatory variables to improve the model. They found that the best predictors of human migration for their research using microdata at the national level for 10 countries were: the urban proportion of the population, economic activity, and the male proportion of the population. The gravity model that I created also contains additional variables to improve the model like nighttime lights and total populations in each adm1. These were the variables I had access to as raster data from WorldPop.

Data preprocessing

For reference throughout this report, here is a plot of Eswatini at the national level with each adm1 labeled.
*Figure 1:* Total origin flows represented by size of centroid (left) and total destination flows represented by size of centroid (right)

Figure 1: Total origin flows represented by size of centroid (left) and total destination flows represented by size of centroid (right)

Furthermore, the following mapping is sometimes used:

WorldPop provides flow data about country-specific migration flows. I downloaded a csv corresponding to Eswatini’s flow patterns from 2010. I then read this file into R using tidyverse’s function read_csv. To map each of the values from 1 to 4 to the correct adm1, I plotted each of the centroids that I read in using the function read_sf.

The next step is to create an origin destination matrix and a distance matrix. The origin destination matrix looks can be generated using the command pivot_wider, a tidyverse command to expand long data into wide data. This pivoting operation is the opposite of that used in project 2 to pivot our household data to individual data.

flow_matrix <- flow_data %>% 
            dplyr::select(NODEI, NODEJ, PrdMIG) %>% 
            pivot_wider(names_from = NODEJ, values_from = PrdMIG) %>%
            dplyr::select(`1`, everything(), -NODEI)

The resulting matrix looks like this.

Hhohho Lubombo Manzini Shiselweni
NA 1318.984 1983.198 308.5253
1371.8446 NA 1714.392 685.9483
1916.4960 1587.818 NA 909.7113
566.1081 1205.966 1761.763 NA

Now we can create a matrix representing distances from each origin to destination. The package geosphere has a function distHaversine that allows us to calculate distances by assuming a spherical earth.

We can use a for-loop to calculate distances for all of our rows in flow_data and then append these distances to a new vector which gets appended to flow_data.

x <- c(1:nrow(flow_data))
distances <- c()
for(val in x) {
  distances <- append(distances, distm(c(flow_data[val, ]$LONFR, flow_data[val, ]$LATFR), c(flow_data[val, ]$LONTO, flow_data[val, ]$LATTO), fun = distHaversine))
}
flow_data <- flow_data %>% add_column(distances)

Then we can pivot our flow data again to create a distance matrix.

Hhohho Lubombo Manzini Shiselweni
NA 72470.36 52365.24 109707.03
72470.36 NA 60782.41 71377.57
52365.24 60782.41 NA 60903.19
109707.03 71377.57 60903.19 NA
Now we turn to origins and destinations. Let us begin by visualizing the total number of people leaving and coming to each adm1.
*Figure 2:* Total origin flows represented by size of centroid (left) and total destination flows represented by size of centroid (right)*Figure 2:* Total origin flows represented by size of centroid (left) and total destination flows represented by size of centroid (right)

Figure 2: Total origin flows represented by size of centroid (left) and total destination flows represented by size of centroid (right)

Comparing the two plots, we notice that the number of incoming and outgoing people is fairly consistent. One could note, however, that more people appear to be coming into Hhohho and Lubombo than leaving because the corresponding points are larger for the destination plot compared to the origin plot.

Visualizing the flows

Now that we have extracted the flow data and compared the total origin and destination flows, we can look more closely at how the agents are moving between each adm1. We can accomplish this with a series of animated visualizations.

Let us start most simply by only looking at the flow from Hhohho to all other adm1’s. We start with a data frame flow_data that we defined earlier. At this point it looks like this:

ISO NODEI NODEJ LONFR LATFR LONTO LATTO PrdMIG distances
SWZ 1 2 31.32644 -26.08605 31.84264 -26.54401 1318.9845 72470.36
SWZ 1 3 31.32644 -26.08605 31.23230 -26.54882 1983.1981 52365.24
SWZ 1 4 31.32644 -26.08605 31.42765 -27.06740 308.5253 109707.03
SWZ 2 1 31.84264 -26.54401 31.32644 -26.08605 1371.8446 72470.36
SWZ 2 3 31.84264 -26.54401 31.23230 -26.54882 1714.3924 60782.41
SWZ 2 4 31.84264 -26.54401 31.42765 -27.06740 685.9483 71377.57
SWZ 3 1 31.23230 -26.54882 31.32644 -26.08605 1916.4960 52365.24
SWZ 3 2 31.23230 -26.54882 31.84264 -26.54401 1587.8183 60782.41
SWZ 3 4 31.23230 -26.54882 31.42765 -27.06740 909.7113 60903.19
SWZ 4 1 31.42765 -27.06740 31.32644 -26.08605 566.1081 109707.03
SWZ 4 2 31.42765 -27.06740 31.84264 -26.54401 1205.9658 71377.57
SWZ 4 3 31.42765 -27.06740 31.23230 -26.54882 1761.7632 60903.19

Since we only want to look at the flows originating from Hhohho, we define a new variable origin_flows.

origin_flows <- st_as_sf(flow_data, coords = 4:5, crs = st_crs(swz_adm1))

To draw lines between Hhohho and all other adm1’s we find the centroids for each adm1 and convert them to geometries. Then, we generate all possible combinations of centroids with expand.grid.

adm1_centroids <- st_centroid(swz_adm1) %>% st_geometry()
od_combos <- expand.grid(adm1_centroids, adm1_centroids)

Next, we want to pull out all destinations from one origin point and make this a new connected geometry object. We can connect these as lines from Hhohho’s centroid to all other adm1 centroids. Finally, we include migration data and subset down to Hhohho as our origin with ID 1.

pt_1 <- st_union(od_combos$Var2[1], od_combos$Var1[2:4])
ln <- st_cast(pt_1, "LINESTRING") %>% 
  st_as_sf()
ln <- ln %>%
  add_column(subset(origin_flows, NODEI == 1) %>%
               st_set_geometry(NULL) %>%
               dplyr::select(PrdMIG))
Below is a plot of flows from Hhohho to all other adm1 centroids with line weights corresponding to the number of people traveling from Hhohho to each adm1.
*Figure 3:* Flows from Hhohho to each `adm1` indicated by line weight

Figure 3: Flows from Hhohho to each adm1 indicated by line weight

More people are traveling to Lubombo and Manzini as opposed to Shiselweni. As predicted by the gravity model, flow appears to be inversely related to distance; distance is a deterrent for movement.

Next, we want to animate one person moving from Hhohho to Lubombo. We can simply select points along our first line from Hhohho to Lubombo and establish a time variable that corresponds to how much time (scaled) passes as the point moves.
*Figure 4:* A single person moving from Hhohho to Lubombo

Figure 4: A single person moving from Hhohho to Lubombo

Then, we want to animate people moving to and from all adm1’s to all other adm1’s. In order to create the lines that each point moves along, we repeat the following code for each of the 12 lines connecting adm1’s to and from each other.

start = c(flow_data$LONFR[1], flow_data$LATFR[1])
end = c(flow_data$LONTO[1], flow_data$LATTO[1])
pt_matrix <- rbind(start, end)
new_line1 <- st_linestring(pt_matrix)

While this method is much clunkier than using st_union on the centroids, it ensures that the linestrings correspond to the correct origin, destination, and flow. We store the complete lines in a new variable lns_all.

lns_all <- st_sfc(new_line1, new_line2, new_line3, new_line4,
                   new_line5, new_line6, new_line7, new_line8,
                   new_line9, new_line10, new_line11, new_line12,
                   crs = st_crs(swz_adm1)) %>% st_as_sf()

One may suggest that we replace the hardcoded line calculations (i.e. start, end, etc.) with a for-loop. However, sf does not like appending linestrings to a simple feature collection programmatically. Instead, all lines must be created and added at once using the command st_sfc. Realizing there was no easy way to put the linestrings in the correct order without hardcoding took me much longer than I would like to admit.

Again, we sample all points along the line repeated 20 times for the 20 points that fall between each origin and destination to give our moving people paths to follow. Here is the resulting animation.
*Figure 5:* People moving to and from all `adm1`'s

Figure 5: People moving to and from all adm1’s

While this animation is nice to look at, we still don’t have a clear picture of the flows to and from each adm1. We have the flow_matrix which shows the number of people in matrix form moving between each adm1. Figure 3 has lines extending from Hhohho with weights corresponding to the flow from Hhohho to every other adm1. To show national flows between all adm1’s, we can add the flow variable PrdMIG from flow_data to our points.

We start by creating a new vector PrdMIG and repeating the flows 20 times so that it will match our animation setup of repeating each moving point 20 times along our lines.

PrdMIG <- c()
for(i in 1:nrow(origin_flows)){
  PrdMIG <- c(PrdMIG, rep(origin_flows$PrdMIG[i], each = 20))
}

We then scale the flows so that they can be reasonable represented as point sizes in our animation.

p_all$size <- PrdMIG / 1000 
Here is the resulting animation.
*Figure 6:* Flows at the national level represented by point size

Figure 6: Flows at the national level represented by point size

There are many possible animation improvements. First, I could have used a scaled number of points to represent the amount of flow between centroids, but I chose instead to scale the size of single points according to the flows. This conveys the same information as adding additional points. Furthermore, the time variable could be adjusted to 5 to correspond to the 5 years during which people moved. In the next section, I will create a gravity model and compare its predicted simulation of migration to this actual simulation of migration from WorldPop data.

Predicting flows at the national level

Now that we have created an animation for the flows at the national level based on WorldPop adm1 flow data, we want to create a gravity model that we will later apply to our selected adm2 in Eswatini: Pigg’s Peak.

For now, our gravity model will be based only on distance between each adm1 and fit to the WorldPop flows. We use the gravity package as a wrapper for our gravity model and use the function ppml which, according to the documentation, “estimates gravity models in their multiplicative form via Poisson Pseudo Maximum Likelihood.” We are required to provide an additional variable for prediction. flow_data$var <- 1 serves as a placeholder. We provide the following arguments to ppml.

fit <- ppml(
  dependent_variable = "PrdMIG",
  distance = "distances",
  additional_regressors = c("var"),
  robust = TRUE,
  data = flow_data
)
Calling fitted(fits) outputs the predicted flows to and from each adm1. We can use fitted(fits) to create a new animation to show the modeled flows. In Figure 7 we show the actual flows and predicted flows side-by-side.
*Figure 7:* Compare actual flows at the national level (left) to modeled flows (right)*Figure 7:* Compare actual flows at the national level (left) to modeled flows (right)

Figure 7: Compare actual flows at the national level (left) to modeled flows (right)

Notice how the points moving along the same line are the same size; distance is the only predictor at the moment and thus the same as long as the origin and destination centroids are the same but opposite. To remedy this and improve our model, we can add in two more flow predictors: population and nighttime lights. For each row with an origin and destination, we want a corresponding origin adm1 population and destination adm1 population and nighttime lights for both origin and destination.

The long (and arguably ugly) way to accomplish this is to crop and mask down to each adm1. Then we can sum and floor the night time lights or population raster data from WorldPop. Here is an example for Hhohho’s nighttime lights.

hhohho <- swz_adm1 %>%
  filter(NAME_1 == "Hhohho")
hhohho_lights <- crop(night_lights, hhohho) %>%
    mask(hhohho) %>%
    cellStats("sum") %>%
    floor()

Once we have the total nighttime lights from each adm1, we can create a new vector, use expand.grid to yield all possible combinations, and add it to flow_data.

all_lights <- c(hhohho_lights, lubombo_lights, manzini_lights, shiselweni_lights)
all_lights_expanded <- expand.grid(all_lights, all_lights) %>%
                       filter(Var1 != Var2)
flow_data <- add_column(flow_data, night_lights_ori = all_lights_expanded$Var2) %>%
             add_column(night_lights_des = all_lights_expanded$Var1)

We also want to filter out combinations that yield the same night time lights as they are mapping the same adm1 to itself, hence the command filter(Var1 != Var2).

A faster method to accomplish the same outcome is shown here for population.

# Get raster information for population from WorldPop
swz_pop20_raster <- raster("world_pop/swz_ppp_2020.tif")

# Use cluster to extract raster information faster
beginCluster(n = detectCores() - 1)
swz_pop20 <- raster::extract(swz_pop20_raster, swz_adm1, df = TRUE)
endCluster()

# Add up the population at each cell that corresponds to a given `adm1`å
agg_pop20 <- swz_pop20 %>% 
             group_by(ID) %>%
             summarize(pop20 = sum(swz_ppp_2020, na.rm = TRUE)) 
agg_pop20_expanded <- expand.grid(agg_pop20$pop20, agg_pop20$pop20) %>% 
                      filter(Var1 != Var2)
flow_data <- add_column(flow_data, pop_ori = agg_pop20_expanded$Var2) %>%
            add_column(pop_des = agg_pop20_expanded$Var1)
Inputting these new variables into our PPML yields the following plot (right) compared to the actual flow plot (left).
*Figure 8:* Actual flows (left) and improved model flows (right)*Figure 8:* Actual flows (left) and improved model flows (right)

Figure 8: Actual flows (left) and improved model flows (right)

Our new model matches reality much better just qualitatively comparing the flows here. While the distance variable scaled the actual flow sizes accordingly, nighttime lights and populations scaled the relative flow sizes.

Applying the national model to Pigg’s Peak

Now we want to zoom in to our selected adm2, Pigg’s Peak.

Combining the results of projects 1 and 2 and calculating voronoi polygons given the urban area centroids yields the following plot.
*Figure 9:* Pigg's Peak with urban areas (cyan) and associated vornoi polygons

Figure 9: Pigg’s Peak with urban areas (cyan) and associated vornoi polygons

Urban areas are enclosed in polygons that are defined based on the densest areas of Pigg’s Peak. Voronoi polygons are calculated using the command st_voronoi. A voronoi polygon is defined such that every point within the polygon is closer to the enclosed centroid compared to any other. Now we can simply repeat the process that we used at the national level but replace the adm1 polygons with our voronoi polygons. The final data frame that we will pass into our national gravity model looks like this.

origin destination distance pop_ori pop_des night_lights_ori night_lights_des dist_log model_flows NODE_FROM NODE_TO
MULTIPOINT ((31.25902 -25.9… MULTIPOINT ((31.19135 -26.0… 11331.023 2174.889 7565.999 2174.889 7565.999 9.335300 45712.46 1 2
MULTIPOINT ((31.25902 -25.9… MULTIPOINT ((31.27092 -26.0… 6405.440 2174.889 2907.335 2174.889 2907.335 8.764903 132001.91 1 3
MULTIPOINT ((31.25902 -25.9… MULTIPOINT ((31.29575 -25.9… 6886.757 2174.889 2514.279 2174.889 2514.279 8.837356 115082.57 1 4
MULTIPOINT ((31.25902 -25.9… MULTIPOINT ((31.36488 -26.0… 11567.381 2174.889 2046.735 2174.889 2046.735 9.355944 43368.76 1 5
MULTIPOINT ((31.25902 -25.9… MULTIPOINT ((31.21392 -26.0… 8150.758 2174.889 2490.216 2174.889 2490.216 9.005866 83835.55 1 6

Only the first 5 rows are shown which correspond to all combinations that start at the first polygon’s centroid and go to all other poylgons. The columns here must have the exact same names as the columns of flow_data to make our prediction. The log of the distances stored in dist_log is also required to predict flows.

1 2 3 4 5 6
NA 45712.46 132001.91 115082.57 43368.76 83835.55
44879.49 NA 78244.16 18423.15 18935.63 477493.75
131672.52 79497.51 NA 38260.83 62122.39 162671.89
114949.41 18743.38 38312.17 NA 37252.65 26510.49
43387.73 19295.51 62305.03 37312.11 NA 25785.30
83745.42 485833.20 162903.52 26512.67 25746.32 NA

From this flow matrix, we see that the flows are much too large; the population of Pigg’s Peak was only 19,699 in 2015 and we had some predicted flows that were over ten times as large as this population. This is likely because we attempted to apply a model that was trained at the national scale with little data, not much variation in distances, and no distances on the scale of that seen in Pigg’s Peak, to a region composed of more subdivisions with much closer distances. As we saw comparing the gravity model trained at the national level on just distances compared to the one trained on distances, nighttime lights, and population, the distances scaled the actual flows and the nighttime lights and population variables only served to adjust the relative flow sizes. Thus, we cannot expect the nighttime light and population variables provided at the adm2 level to scale down our flows, only perhaps to adjust their proportions relative to each other. Thus, perhaps how the flows compare to each other may be accurate, but the actual flows certainly are not. Why attempt this investigation at all then? We showed that we could make model predictions using the gravity model trained at the national level. It would likely yield better results applied to another region near and of similar size to Eswatini.

Another reason our model may not have translated from the national level to the adm2 level is the time component. At the national level, the flow data was recorded for a period of five years. However, depending on a person’s job, school, etc., one may travel from one region of Pigg’s Peak to another in a day. Thus, the time variable would need to be adjusted to reflect this difference perhaps to a day or a few days.

Project Code: https://github.com/ceh-2000/ABM/blob/master/scripts/gravity_model.R